<html>
<h2>Pipeline Processing Scripts</h2>

The AW package includes a set of scripts designed to facilitate batch processing for
transcript DE and ASE analysis. The scripts cover quality control, trimming, alignment
to reference, variant calling, quantification of expression, and removal of reference bias (for ASE studies).
The scripts call on a number of external tools (such as samtools and tophat2) which have
also been included in the package (in the "ext" directory). 
<p>
These batch functions are also available on the iPlant Atmosphere image, which you may
wish to use if your computational resources are limited. You can expand and reduce the image
size as needed for a pipeline step.

<p>
The batch scripts follow consistent conventions:
<ul>
<li> Running any script with no parameters gives its help text
<li> All scripts create a capitalized directory (e.g., TRIM) containing output
<li> The main outputs are in a subdirectory "Results" (e.g.,TRIM/Results)
<li> All scripts create a Summary.html with collected run statistics
<li> The primary input to one script is usually the Results directory from a prior
<li> Most scripts have a "-t" flag for number of threads/processes to use
<li> If you wish to make more detailed parameter changes or use different versions of
the underlying tools such as tophat, you can do this easily by editing variables at the top of the scripts
</ul>

<p><i>Important:</i>If your read files are paired, you must name them using suffixes to
identify the pairs. By default, the scripts look for suffixes "_R1", "_R2" to indicate
pairing; if you use a different convention then edit the variables "$fSuffix" and "$rSuffix"
at the top of the Align and Trim scripts. 
<p>
 In addition, for loading into the AW database, filenames
 need to contain labels identifying the sample and conditions; see the Files documentation</a>
 for details.
 
 <p>
The following table summarizes the batch scripts provided and the tools they call. (Before
running the pipeline, it is important also to familiarize yourself with the underlying tools
such as tophat). 
<p>

<table border=1 cellspacing=3 cellpadding=3>
<tr><td width=90><i>Script</i><td width=175><i>Input</i><td width=175><i>Output</i><td width=120><i>Tools Used<td><i>Postprocess</i>
<tr><td>1. GSmask<td>Genome sequence file, variant file (vcf)<td>Masked genome sequence
			<td>Bedtools <td> -
<tr><td>2. QC<td>NGS read files (fastq)<td><tt>HTML</tt> files with quality metrics.
			<td>FastQC <td> Unified HTML
<tr><td>3. Trim<td>NGS read files <td>Trimmed read files
			<td>Trimmomatic <td> -
<tr><td>4. Align <td>Trimmed read files, masked genome<td>alignment files (.bam)
			<td>Tophat2 <td> -
<tr><td>5. snpASE<td>Alignment files + variant file<td>Ref/alt coverage counts at each SNP location 
			(in .bed file format)<td>Samtools <td> Parse pileup to counts
<tr><td>6. transASE<td>Diploid transcript files (see below)<td>Ref/alt estimated expression levels
		for each transcript<td>STAR + eXpress <td> Create TCW counts
<tr><td>7. Variants<td>Alignment files<td>SNPs/Indels for each alignment, and combined
		<td>Samtools <td>Combine sample calls
</table>
<p>
The package also includes a utility script <tt>GSsplit.pl</tt> which splits a whole-genome fasta file into its separate chromosomes, for loading into AW. 


<h3>Input Files</h3>
The required input is:
<ul>
<li>Raw read files; see the Files documentation for <i>important</i> naming requirements.
<li>Genome annotation in GTF2.2 format (see <a href="http://mblab.wustl.edu/GTF22.html">http://mblab.wustl.edu/GTF22.html</a> for details),
along with the Files documentation.
<li>Genome sequence directory of chromosome FASTA files (see the Files documentation for details; if
your genome is in a single file, use the <tt>GSsplit.pl</tt> utility to split it).
</ul>
For ASE you will need also a variant file (<tt>.vcf</tt>); this may also be generated by calling denovo variants 
on the read set. (Note that indels may be loaded and viewed in AW, but only SNPs are used in the expression analysis.)  

<h3>Example Batch Pipeline Run</h3>

<ol>
<li>Start with fastq paired read files in directory <tt>reads/</tt>, genome fasta file(s) in directory <tt>genome/</tt>,
and annotation file <tt>annotation.gtf</tt>. We will assume that you also already have SNP calls 
<tt>snps.vcf</tt> (for denovo calling see below).  
<li>Start by checking the read quality:
<p>
<tt>scripts/QC.sh -d reads</tt>
<p>
<li>Study the <tt>html</tt> files in the resulting <tt>QC</tt> directory and decide on trimming parameters.
<li>Execute a trim, e.g.:
<p>
<tt>scripts/Trim.pl -d reads -H 5 -S 5:30 -p </tt>
<p>
<li>and then check the quality of the resulting reads:
<p>
<tt>scripts/QC.sh -d TRIM/Results</tt>
<p>
If the quality is not acceptable, re-run the trim with different parameters (rename the <tt>TRIM</tt>
the directory if you wish to preserve prior results).
<p>
<li>Next, for ASE studies it is important to reduce reference bias, which can be done by masking
the SNP loci in the genome before aligning reads:
<p>
<tt>scripts/GSmask.pl -i genome -v snps.vcf </tt>
<p>
<li>Now the masked genome files are in <tt>GSMASK/Results</tt>. Align the trimmed reads to the
genome, providing the annotation as guidance, and using 10 threads:
<p>
<tt>scripts/Align -g GSMASK/Results -r TRIM/Results -p -a annotation.gtf -t 10</tt>
<p>
This can take some time and generates bam alignment files in <tt>TOPHAT/Results</tt>, as 
well as an alignment summary <tt>TOPHAT/Summary.html</tt>. 
<li>Check the summary and if it seems
ok then proceed to generate the ASE SNP coverage counts; these measure ASE by counting the read
coverage for the ref/alt alleles at each SNP locus.
<p>
<tt>scripts/snpASE.pl -i TOPHAT/Results -v snps.vcf</tt>
<p>
This script produces count files in the <tt>SNPCOV/Results</tt> directory which have a <tt>.bed</tt>
format and may be loaded into AW.
<p>
<li>At this point you have everything needed for AW analysis of ASE. Build the AW project using <tt>runAW</tt>.
<p>
<li>When building the AW database, the SNP coverage is summed for each transcript, however 
you may also want a more explicit estimation of read coverage per transcript. 
For this purpose the <tt>eXpress</tt> package has been included, which takes into account isoforms,
and can also compute ASE if provided with the transcripts from the different alleles. 
<p>
<li>To run <tt>eXpress</tt>, first create the AW project first create the AW database 
using data generated so far. 
This automatically creates "ref" and "alt" transcript sets by using the given
SNP and annotation files. The transcript files are located 
in directory <tt>projects/&lt;project_name&gt;/AW_outputs</tt>, and are named
<tt>ntRef.fasta, ntAlt.fasta</tt>.  
<p>
<li>Once you have the generated transcript files, align the reads to them using the
 <tt>transASE.pl</tt> batch script. This 
script calls the <tt>STAR</tt> aligner to do the alignment, and then <tt>eXpress</tt> to quantify
the expression; ASE is seen as expression differences between ref and alt transcripts. 
(<tt>STAR</tt> is used because
it runs much faster than Bowtie with the lenient parameters which are recommended for <tt>eXpress</tt>.)
<p>
<tt>scripts/transASE.pl -t ntRef.fasta -u ntAlt -i TRIM/Results -p</tt>
<p>
<li>The result is expression files (<tt>.xprs</tt>) which can be loaded into AW for the 

second measure of ASE. A directory <tt>ResultsTCW</tt> is also created which contains
ordinary (non-ASE) expression quantification, and can be loaded into TCW for DE studies. 
</ol>


<h3>Batch Script Parameters and Versions of Tools</h3>

The batch scripts expose only a few of the underlying parameters of the tools which they use. 
If you wish to adjust more of these parameters, you can edit variables at the head
of each script. Also, if you wish to use a different tool version than supplied
in the <tt>ext</tt> directory, you can edit the tool path variables at the head of the scripts.

<h3>Variant Calling</h3>

If the strains you are working with do not have predetermined variants, then you will need to
call them denovo from your given dataset. For this purpose a script <tt>Variants.pl</tt> has
been provided, which uses the <tt>Samtools/Bcftools</tt> variant call functions. Typically you
will have one bam alignment file per sample, and the tool will call separately for each alignment
and then combine them into one using a simple threshold (by default, called in at least 5 samples):
<p>
<tt>scripts/Variants.pl -g genome -i TOPHAT/Results -t 01 -m 5 </tt>
<p>
The result is a file <tt>combined.vcf</tt> with the combined calls, as well as <tt>Results</tt> directory
containing the calls for each sample.
<p>
The script uses samtools to call variants separately for each alignment
bam file, and  then calls another script VarComb.pl to combine them
into a single output combined.vcf (you can run VarComb on its own
if desired). 
<p>
Here the -t and -m are VarComb parameters. 
-t indicates selecting for heterozygous variants, appropriate
to an F1 hybrid. If you have an inbred, choose -t 11 or leave off
the parameter to keep all variants.
-m sets the threshold number of samples need to call a combined snp. 
<p>
SNP-calling is highly dependent on project details, e.g. whether you have data for inbred (homozygous) lines,
whether the reference sequence matches at least one strain, etc. The provided scripts should prove useful
in many scenarios but will often need to be supplemented with other SNP manipulation 
programs such as <tt>vcftools</tt>. 

<h3>Supported Platforms</h3>

The external binaries supplied are 64-bit Linux and the pipeline has been 
tested on RedHat Enterprise Server 6.5 and CentOS 6.5. On other platforms 
the scripts should work, however you may need to supply some or all of
the external tools and alter the path variables in the script headers. 


</html>
